Method of specifying snp

ABSTRACT

The present invention is intended to provide a technique relating to a method of specifying an SNP which comprises repeating presumption of an SNP serving as a marker and detailed typing of SNPs around the same, thus gradually narrowing down the focus to the base sequence domain in which the ‘target’ SNP is likely contained and finally specifying the ‘target’ SNP at a high efficiency. As FIG.  1  shows, the method of specifying an SNP comprises: (1) determining a drug to be developed which is the subject of the determination; (2) collecting samples to be analyzed; (3) determining a ‘scanning domain (base sequence domain)’; (4) determining ‘typing’ SNPs; (5) SNP typing by the wet process and analyzing haplotypes based on the typing data; (6) presuming a ‘marker’ SNP (determining the analytical data); and (7) specifying the ‘target’ SNP (target SNP). A cycle consisting of the stage (1) to (7) is repeated as a treatment cycle.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to a method of specifying SNP for drugresponsiveness that employs SNP function analysis techniques for theimprovement of the efficiency and accuracy of the SNP function analysisprocess, and for use in clinical trials of newly developed drugs by thedrug industry.

2. Description of the Related Art

In the conventional SNP function analysis process that specifies the SNP(Single Nucleotide Polymorphism-single nucleotide polymorphism orposition of a single nucleotide polymorphism) related to diseasesusceptibility or drug responsiveness, SNP typing by a wet process(Note 1) was performed after narrowing down the SNP to be analyzed toseveral tens or several thousands of locations due to cost.

FIG. 7 is a drawing showing the flow of the conventional SNP functionanalysis process. As shown in FIG. 7, conventional SNP function analysiswas performed in the order of preliminary step A (determining of newdrug to be developed), preliminary step B (collecting samples to beanalyzed), step A (setting typing SNP), step B (SNP typing by a wetprocess), step C (data analysis) and step D (specifying the ‘target’SNP).

This is because the cost (chemical costs per sample) in the case oftyping all of the 3 million human SNP using the TaqMan method (Note 2)is approximately 200 million yen, and in order to perform SNP typing forthe several hundred or several thousand of samples necessary forstatistical analysis of the SNP function, unrealistic costs such asseveral 10 to several 100 billion yen, and resources such as large-scaleanalysis facilities are necessary.

Therefore, in the normal SNP function analysis process, the SNP to betyped (hereafter referred to as typing SNP) are limited, and functionanalysis is performed after narrowing down the SNP to a 1000 to 10,000SNP.

However, there is no other method to determine whether or not there is arelationship between disease susceptibility or drug responsiveness andSNP than by statistical determination from the results of typing thoseSNP. Therefore, the ‘target’ SNPs (Note 3), which are finally determinedto be related, must be included in and selected from a group of 1,000 to10,000 SNP beforehand as typing SNP. In the case that these SNP are notselected, the related SNP cannot be found in the analysis, and so theanalysis process must be performed again from selection of a group oftyping SNP.

In the conventional method of selecting typing SNP, the researcher useda technique of searching reference documents such as research papers andgenome-related databases, and performing a homological search thatpredicts the function of human genes that are similar to genomes thatare not human whose functions are already known.

However, the functions of human genomes are not completely given in thisgenome information. Therefore, the step of selecting typing SNP thatdetermine the efficiency of this SNP function analysis process, or inother words, whether or not it is possible to predict a ‘target’ SNP ata high probability, depends largely on the experience and skill of theresearcher as well as luck.

Also, one more problem with the SNP function analysis process is thequality of data. In the SNP function analysis, SNP typing is performedbetween sample groups that are classified according to whether or notthey express certain characteristics (for example, behavior orsusceptibility), and the frequency of allele of each of the SNP of bothgroups is analyzed statistically, and the SNP that causes thatcharacteristic to be expressed is identified. In other words, when thereis a problem with the quality of typing data of the wet process, theresults of SNP function analysis based on that data become inaccurate.

This problem is due to the fact that SNP typing is a process with humanintervention. Many of the quality problems that are inherent in the SNPfunction analysis process, such as contamination and careless mistakesin operation such as mixing up samples and reagents that cause a drop inquality of the data are human related, and quality is largely dependenton these and the experience and skill of the researcher.

(Note 1) The wet process is a process for performing SNP typing. In thecurrent TaqMan method, gene samples of blood or the like are caused toreact with a reagent on a plate and hybridization is performed, and theresults are optically measured, and then finally, typing of the alleleof the samples is performed using that SNP. This process is called a wetprocess. Statistical analysis of the specified typing data is notincluded in the wet process.

(Note 2) This is a typing method that uses the PCR (Polymerase ChainRereaction) reaction between fluorescent-labeled allele specific oligoand Taq DNA polymerase.

(Note 3) The ‘target’ SNPs or SNPs that will become the ‘target’ areeither SNPs that cause the disease susceptibility or drug responsiveness(of newly developed drugs), or SNPs that are indicators or markers ofdisease susceptibility or drug responsiveness. The object of the SNPfunction analysis is to specify these SNPs.

However, there were the following problems in the conventionaltechnology.

There are problems in that predicting ‘target’ SNPs before typing andproperly and accurately selecting a group of several hundred or severalthousand SNP that include these is difficult, and preventing occurrencesof mixed up samples or reagents and contamination due to human errorduring the wet process that lower the quality of the SNP typing data isextremely difficult.

SUMMARY OF THE INVENTION

Taking the aforementioned problems into consideration, the object of thepresent invention is to provide a method of specifying SNP that 1)gradually narrows down the base sequence domain in which it is thoughtthat the ‘target’ SNP exists, and finally specifies the ‘target’ SNPefficiently by repeatedly estimating a SNP as a marker and performingdetailed typing of the SNPs near that marker, and 2) compares statisticsfor patients and non-patients and narrows down the SNP domain by aprocess-control method that prevents careless mistakes in operation thatcause a drop in data quality in the wet process and by eliminating anydata that were contaminated by contamination or the like beforeperforming statistical analysis.

The invention according to a first claim is a method of specifying SNPrelated to disease susceptibility or drug responsiveness and comprising:a first step of setting a scanning domain beforehand in the basesequence domain that is the object of SNP analysis; a second step ofgradually narrowing down the scanning domain to a localized domain thatcontains a target SNP; and a third step of specifying the target SNPfrom the narrowed down localized domain.

The invention according to a second claim is the method of specifyingSNP of claim 1 in which the second step comprises a step of setting amarker SNP for specifying the target SNP and gradually narrowing downthe scanning domain.

The invention according to a third claim is the method of specifying SNPof the second claim in which the second step uses statistical analysissuch as haplotype analysis to set the marker SNP.

The invention according to a fourth claim is the method of specifyingSNP of claim 3 in which the first step comprises: a step of setting thescanning domain of the base sequence domain in a genome domain that islimited to genes whose functions are clearly known or chromosomes whosefunctions can be predicted; and the second step comprises: a fourth stepof selecting a group of SNP to be typed from the scanning domain andperforming SNP typing using a wet process; a fifth step of finding theprobability of appearance of all combinations of the haplotype analysisin the scanning domain based on typing data of the SNP typing as astatistical amount; and a sixth step of comparing the found statisticalamount with a preset or estimated reference statistical amount, and whenthere is significant deviation between the statistical amount and thereference statistical amount that exceeds a preset threshold,determining that the marker SNP is contained in the domain correspondingto the deviated position that exceeds the threshold value.

The invention according to a fifth claim is the method of specifying SNPof claim 4 in which the third step comprises: a seventh step ofincreasing the specified ratio of the number of SNPs to be the object oftyping in the selection of the SNP group in the fourth step when thesignificant deviation is less than a first threshold value, and thenrepeating the fifth step; an eighth step of setting a new scanningdomain from the scanning domain that has been decreased by a specifiedratio such that it contains the position of the deviated peak when thesignificant deviation is greater than the first threshold value but lessthan a second threshold value, and then repeating the fifth step; and aninth step of determining that the marker SNP is contained in the domaincorresponding to the deviated position that exceeds the second thresholdvalue when the significant deviation exceeds the second threshold value,setting a new scanning domain from the scanning domain that has beendecreased by a specified ratio such that it contains the position of thedeviated peak, and then repeating the fifth step.

The invention according to a sixth claim is the method of specifying SNPof claim 5 in which the ninth step comprises a step of setting SNPs thatinclude the target SNP for which all DNA samples are typed when thenumber of SNPs in a selected group is less then a specified number.

The invention according to a seventh claim is the method of specifyingSNP of claim 5 in which the seventh step comprises a step of determiningthat the target SNP is not contained and stopping the process when thenumber of times the process of the fifth step is performed exceeds aspecified number of times.

The invention according to an eighth claim is the method of specifyingSNP of claim 5 in which the eighth step comprises a step of determiningthat the target SNP is not contained and stopping the process when thenumber of times the process of the fifth step is performed exceeds aspecified number of times.

The invention according to a ninth claim is the method of specifying SNPof any one of the claims 1 thru 8 in which the second step comprises astep of typing the SNP using a quality controlled process, and where thequality controlled process performs typing of four SNP on one assayplate for one sample, and determines that the typing data is invalidwhen the number of typed SNPs found having significant difference by astatistical method such as the Chi-square test exceeds a specifiednumber and identifies the data as being contaminated by contamination ofthe sample.

The invention according to a tenth claim is the method of specifying SNPof claim 9 in which the second step repeats SNP typing only for SNPfound to have significant difference when the number of typed SNPs foundto have significant difference was a specified number, and when theresult of no significant difference continues for a specified number oftimes, determines that the typing data is correct and uses that data.

The invention according to an eleventh claim is a computer program thatcan be read by a computer that can execute the processing of the methodof specifying SNP of any one of the claims 1 thru 10 in which all of thesteps of any one of the claims 1 thru 10 are coded.

BRIEF EXPLANATION OF THE DRAWINGS

FIG. 1 is a process flowchart showing the method of specifying SNP of afirst embodiment of the invention.

FIGS. 2A to 2C are drawings showing steps 3, 4 and 6 of FIG. 1 indetail.

FIG. 3 is a drawing showing an example of the sample and reagent tubes,various plates and related data schemes for step S5 in FIG. 1.

FIG. 4 is a drawing showing an example of the arrangement pattern of SNPon the assay plate 10AP in FIG. 3.

FIG. 5 is a drawing showing an example of the analysis data of step S7(step 5) in FIG. 1.

FIG. 6 is a drawing showing an example of the analysis data of step S7in FIG. 1.

FIG. 7 is a drawing showing the process flow of conventional SNPfunction analysis.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

The preferred embodiments of the present invention are explained belowbased on the drawings.

(Embodiment 1)

FIG. 1 is a process flowchart showing the method of specifying SNP of afirst embodiment of the invention. As shown in FIG. 1, the method ofspecifying SNP of this first embodiment comprises: a step (step S1) ofdetermining a drug to be newly developed for which the SNP will bespecified, a step (step S2) of collecting samples to be analyzed, a step(step S3) of determining the ‘scanning domain (base sequence domain)’, astep (step S4) of determining ‘typing’ SNP, a step (step S5) ofperforming SNP typing by a wet process, a step (step S6) of analyzinghaplotypes based on the typing data, a step (step S7) of estimating a‘marker’ SNP (determining analysis data) and a step (step S8) ofspecifying a ‘target’ SNP (target SNP), and wherein steps S3 to S7 arerepeated as one cycle (processing cycle).

With specifying SNP related to disease susceptibility or drugresponsiveness as the object, in this process model, by performing thefollowing eight processes (steps), the ‘scanning domain’ for which SNPtyping is performed is gradually narrowed down, and finally the ‘target’SNP related to whether or not there is drug responsiveness to the newlydeveloped drug is specified.

The eight steps are described below. First, preliminary step 1 is a step(step S1) of determining a drug to be newly developed for which the SNPwill be specified. Preliminary step 2 is a step (step S2) of collectingsamples to be analyzed. Step 1 is a step (step S3) of determining the‘scanning domain ’ (sets the scanning domain beforehand). Step 2 is astep (step S4) of determining ‘typing’ SNP (typing SNP). Step 3 is astep (step S5) of performing SNP typing by a wet process. Step 4 is astep (step S6) of analyzing haplotypes based on the typing data. Step 5is a step (step S7) of estimating a ‘marker’ SNP (determining analysisdata). Step 6 is a step (step S8) of specifying a ‘target’ SNP. Of theseeight steps, step 1 to step 5 (steps S3 to S7) are repeated as one cycle(processing cycle).

Next, the processing by each of the steps will be explained in detailwith reference to FIG. 1.

Preliminary step 1 (determining a newly developed drug for SNPspecification): A drug is selected from among new drugs developed by thedrug industry for which the drug responsiveness (whether or not the drugis effective or has side effects) of the newly developed drug has beentested using SNP. Also, in this process model, it is possible to checkthe relationship with SNP using disease susceptibility as the object forexample.

Preliminary step 2 (collect a sample to analyze): In the SNP functionanalysis, SNP data are compared between sample groups that are separatedaccording to whether or not a certain characteristic is expressed, andthe SNP that causes that characteristic to be expressed is specified(step S2). For example, in the case of checking the relationship betweenthe susceptibility of diabetes and SNP, the allele frequency for eachSNP is statistically sorted between a group of diabetes patients and acontrol group. At this time, the control group used can be either agroup of patients not having diabetes, or an average sample groupextracted at random (regardless of whether or not the patients havediabetes or not).

In the case of specifying SNP related to the drug responsiveness of anewly developed drug, analysis beginning from step 1 is performed for agroup for which the drug was effective or had side effects, and a groupfor which the drug was not effective or did not have side effects.

When there is an average sample group that was extracted at random, orwhen it is possible to use outside SNP data that corresponds to thesegroups, together with the aforementioned two groups, it is possible tocompare and analyze data between a total of three groups and thus moreeffectively perform analysis.

(c) Step 1 (Determining the ‘scanning domain’ (base sequence domain): Inthis process model, by repeating the process from this step 1 to step 5(to be explained later) as one cycle, the ‘scanning domain’ is graduallynarrowed down from an initially large ‘scanning domain’ to a morelocalized ‘scanning domain’. In the last cycle, by analyzing all of theSNP typing data existing in the ‘scanning domain’, the ‘target’ SNP isfinally specified. First, the scanning domain is determined (step S3).This ‘scanning domain’ is a domain that is checked (scanned) for theexistence of a ‘target’ SNP and is a continuous domain of a human genomebase sequence. The physical length is variable as it is graduallynarrowed down.

FIGS. 2A to 2C are drawings to explain steps S3, S4 and S6 in FIG. 1.The steps in the processing cycle will be explained in more detail withreference to FIG. 1 and FIGS. 2A to 2C.

FIG. 2A shows an example of a genome domain that contains SNP 10D instep S3 in FIG. 1. In step S3, the first ‘scanning domain’ is regulated(set) by a large level such as genes or even larger chromosomes. This isbecause even in this state large functions on the chromosome level areclear. Also, this includes the method of analyzing all of thechromosomes as the ‘scanning domain’ in the case when a plurality ofchromosomes are the cause and it is not clearly known which chromosomesare suspect (include the target SNP), or in the case of taking all ofthe chromosomes except for certain chromosomes as the object (when thereis no difference in the result between male and female, the sexchromosomes are meaningless so measures are taken such as to remove themfrom the ‘scanning domain’), or as an extreme example, in the case whenthere is absolutely no information related to narrowing down the targetSNP. Moreover, more specifically, it is possible to set the initial‘scanning domain’ at the gene level for example. Or in other words, thescanning domain (primary scanning domain, initial scanning domain) isset based on a chromosome level for which the functions are known inadvance.

In the second cycle ((n+1)th cycle on, where n is a positive integer 1or greater), the process returns again to step 1 (step S3) from step 5(step S7) of the first (nth) cycle, so the domain in which chainimbalance was found in step 5 in the nth processing cycle (nthprocessing cycle) is set as the new ‘scanning domain ((n+1)th scanningdomain)’ in the (n+1)th processing cycle. At this time, the narroweddown new ‘scanning domain ((n+1) th scanning domain)’ has a length thatis a large fraction down to a small fraction of the ‘scanning domain(nth scanning domain)’ of the previous (nth) cycle. How much the domainis narrowed down in the next cycle depends on the intensity of the chainimbalance between SNP (described later).

(d) Step 2 (Determining the typing SNP): Selects a group of SNPs fortyping from among the ‘scanning domain’ set in step 1.

FIG. 2B shows an example of an SNP group that was selected from FIG. 2A.The SNPs contained in this group can be arbitrarily selected (withoutpaying attention to separation according to gene site function orexon-intron), however, the SNPs should be selected such that theinterval between SNP positions is as uniform as possible. This is inorder to be able to indirectly observe chain imbalances between SNPpositions in this series of analyses, and in order to be able to removeerrors due to differences in physical distance between SNP positionsthat have a large effect on chain imbalances.

Chain imbalances between SNPs that can be analyzed are thought to occurwhen the physical distance between SNP positions is about 10,000 to100,000 nucleotide bases. Therefore, when the SNPs contained in thefirst ‘scanning domain’ cover the entire length of the approximately100,000 chromosomes, it is preferred that the first typing be performedfor about 1,000 SNPs.

From the second processing cycle on, the ‘scanning domain’ is graduallynarrowed down, and the physical distance of the ‘scanning domain’becomes shorter, and thus the number of SNPs contained in this rangedecreases. The typing SNPs are selected in a range from several ten toseveral hundred from this ‘scanning domain’.

FIG. 3 is a drawing showing an example of the sample and reagent tubes,various plates and related data schemes for step S5 in FIG. 1.

(e) Step 3 (SNP typing by a wet process): Performs the SNP typing by awet process shown in FIG. 1 (step S5). In other words, SNP typing ofeach sample is performed for the selected SNP group by the TaqMan PCRmethod or the like. In order to prevent data errors due to contaminationor handling of the samples during this typing process, quality of thetyping data is assured by performing, 1) generation management of thesample and reagent tubes and assay plates using barcodes, and 2)inspection of typing data using ‘Hardy-Weinberg Equilibrium’.

Generation management of the sample and reagent tubes and assay platesusing barcodes: The most common mistake made in the typing process ismishandling of samples and reagents. In the current SNP analysisprocess, some intermediate plates are created up to finally creating theassay plate 10AP that will be used in the typing apparatus forperforming the assay. Therefore, it is important that these plates alsobe managed such that the plates are created from proper plates.

Therefore, together with using barcodes to perform ID management of theplates created by dispensing from the sample and reagent tubes,generation management is performed between tube and plate, or plate andplate to control the relationship of which was created as a parent andwhich was created as a child.

Sample: The sample is managed by sample tubes marked underneath with a2-dimensional barcode, and by a sample rack that can store up to 96 ofthese sample tubes. The arrangement of the sample tubes in the samplerack is managed as data by reading the barcodes on the sample tubes fromunderneath the sample rack with a scanner. Also, a barcode is added tothe sample rack itself.

Reagent: The reagent is managed by reagent tubes marked underneath witha 2-dimensional barcode, and by a reagent rack that can store up to 96of these reagent tubes. Similar to the sample tubes, the arrangement ofthe reagent tubes in the reagent rack is managed as data by reading thebarcodes with a scanner, and a barcode is added to the reagent rackitself.

Plates: There are three kinds of plates: master plates, reagent platesand assay plate 10AP. A master plate has 96 wells per plate (depressionswhere the sample or reagent is dispensed), and the sample is dispensedonto the plate from a sample rack that is similarly capable of storing amaximum of 96 sample tubes. At this time, one sample is dispensed to thewell at one location (the wells on the master plate are positioned usingthe same layout at the sample tubes in the rack). Therefore, the masterplate is managed as ‘child’ data of the sample rack. A reagent platealso has 96 wells (similar to a master plate), and since only one kindof reagent is use for each plate, the reagent plate is managed as‘child’ data of the reagent tubes. The reason for keeping arrangementdata for the arrangement of the reagent tubes in the reagent rack isbecause the automatic apparatus used for dispensing requires the use ofreagent tubes stored in a rack. The assay plate 10AP has 384 wells,making it possible to perform simultaneous typing of 4 SNP for a maximumof 96 samples. This one assay plate 10AP is managed as four virtualplates in the data. One virtual plate is a plate used in performingtyping of one SNP for a maximum of 96 samples, and one assay plate 10APvirtually comprises four of these plates.

2) Checking typing data using ‘Hardy-Weinberg Equilibrium’: A SNP hastwo variations of nucleotide bases for the SNP position, for example, ithas two alleles (allelic genes) such as A (adenine) or G (guanine). Allof the chromosomes are paired, so the SNP positions exist at a total oftwo locations for each of the paired chromosomes. Therefore, there arethe following three SNP patterns that are observed in the TaqManassay: 1) Homozygote A-A of a one-sided allele (in this case, A), 2)Homozygote G-G of a different allele (G) and 3) Hetrozygote A-G, whichis a combination of opposing allele. By taking the probability that theSNP for one gene will have A to be α, then the probability of an A-Ahomozygote is α2, the probability of a G-G homozygote is (1−α)2, and theprobability of an A-G hetrozygote is 2α(1−α), and this relationship iscalled ‘Hardy-Weinberg Equilibrium’. The condition for establishing‘Hardy-Weinberg Equilibrium’ is that data must be extracted at randomfrom a sample group that is in equilibrium after several generations ofhybridization, and at the same time must be statistically extracted byaverage.

Here, when the value of the obtained sample data greatly deviates fromthe ‘Hardy-Weinberg Equilibrium’, then it can be considered that either(1) that data is contaminated data due to contamination of the assaythat generated the data, or (2) the typed sample group was notstatistically extracted at random.

The object of SNP function analysis is to compare SNP data betweensample groups that are separated according to whether or not theyexpress a certain characteristic (disease, etc.), and to specify the SNPthat is the cause for expressing that characteristic. In other words,when the SNPs that are the cause are typed, since there are more SNPsthan expected that are causally related to the characteristic of thatgroup, then it must have an allele distribution that deviates from the‘Hardy-Weinberg Equilibrium’. In other words, in the checking of typingdata using ‘Hardy-Weinberg Equilibrium’, finding the corresponding SNPis the object of the SNP function analysis and this process model.

In order to specify the ‘target’ SNP, the SNPs having ‘contaminated’data in (1) are removed from the SNP of the allele distribution thatdeviated from the ‘Hardy-Weinberg Equilibrium’.

The method of identifying the ‘contaminated’ data due to thiscontamination is a method of typing four SNPs using one assay plate10AP.

FIG. 4 is a drawing showing an example of the arrangement pattern ofSNPs on the assay plate 10AP of FIG. 3. The assay plate 10AP has 384wells, and the layout of the typing SNPs is as shown in FIG. 4.

When there is failure of the assay itself due to contamination orproblems in the PCR method, it is considered that problems will alsooccur for other SNPs on the same plate.

Since the chance of being able to select a plurality of SNPs that arethe cause is thought to be extremely rare, in the case that a pluralityof SNP data for the same plate deviates from the ‘Hardy-WeinbergEquilibrium’, typing is determined to have failed, and all of the datafor that plate is discarded (or the assay is redone).

Furthermore, in order to avoid the possibility of chain imbalancesbetween typed SNP pairs, the four SNPs are separated as much aspossible, or in other words, the ‘scanning domain’ is divided into foursections 10PT (four well blocks), and one SNP to be typed is selectedfrom each respective section.

FIG. 2C shows an example of the scanning domain for step S6 in FIG. 1.In the processed cycle, the window 10 w is moved from the start of the‘scanning domain’ to the end, and an image of analyzing the SNP datacontained in that window 10 w is shown.

(f) Step 4 (haplotype analysis using typing data, step S6): In thisprocess, the two concepts, haplotype and chain imbalance, are used tospecify an SNP (that will in the end become the ‘target’ SNP itself)near the ‘target’ SNP.

The haplotype, which is a combination of opposing genes (SNP allele) onone gamete (one of the paired chromosomes), is stochastically predictedbased on data obtained from the SNP typing assay using the TaqManmethod.

This stochastic prediction of the haplotype will be considered from thecase of haplotypes that are taken by the following three SNP; SNP#1 thattakes A or G, SNP#2 that takes T or G and SNP#3 that takes T or C.

For example, for a sample X, in the case where SNP#1 takes thehomozygote A-A, SNP#2 takes the hetrozygote T-G and SNP#3 takes thehetrozygote T-C, it is predicted that four haplotypes exist for thefollowing two cases. SNP#1 SNP#2 SNP#3 Probability Case 1) Chromosome 1A T T 25% Chromosome 2 A G C 25% Case 2) Chromosome 1 A T C 25%Chromosome 2 A G T 25%

In this example, the probability for each haplotype is 25%.

Furthermore, in another sample Y, in the case where SNP#1 takes thehomozygote A-A, SNP#2 takes the hetrozygote T-G and SNP#3 takes thehomozygote T-T, it is predicted that two haplotypes exist with eachhaving a probability of 50%. Case 1) SNP#1 SNP#2 SNP#3 ProbabilityChromosome 1 A T T 50% Chromosome 2 A G T 50%

Also, the haplotypes predicted from these two samples are as shownbelow. SNP#1 SNP#2 SNP#3 Probability A T T 25%/2 + 50%/2 = 37.5% A T C25%/2 = 12.5% A G T 25%/2 + 50%/2 = 37.5% A G C 25%/2 = 12.5%

In the actual analysis, a continuous domain that contains a certainnumber of SNPs specified by a range of several to several tens of SNPsis defined as the window 10 w, and the combinations and the probabilityof each haplotype emerging from the typing data of the SNPs in thatwindow 10 w (for all samples) is found statistically. When the number ofSNPs contained in this window 10 w is too large, the probability foreach haplotype decreases and it becomes difficult to confirm theexistence of chains or chain imbalances, so it is effective to definethe window 10 w such that it contains about 10 SNPs.

FIG. 5 is a drawing that shows an example of the analysis data for stepS7 (step 5) in FIG. 1.

FIG. 5 shows the procedure for identifying domains in the analysis datain which changes occur in the statistical amount from haplotype analysisfor the first half of the inspection.

(g) Step 5 (estimating the ‘marker’ SNPs, step S7): In this process, asshown in FIG. 5, domains near the ‘target’ SNP are estimated by findingdomains in which the probability of a certain haplotype stands out. Inother words: (1) The haplotypes in the window 10 w are analyzed whilemoving the window 10 w. (2) Changes in statistical amounts such as thenumber of ‘haplotypes’ according to the position of the window 10 w areplotted. (3) Areas where remarkable differences are found in thestatistical amounts between sample groups that are compared areextracted.

It becomes possible to determine whether or not chain imbalances areseen between analyzed SNP groups from the haplotype data that wereanalyzed in step 4.

When no chain imbalances are seen between these SNPs, it is thought thatthe frequency of appearance of each SNP allele will become steady at an‘average’ value, and since each of the SNPs is ‘independent’, thehaplotypes that are statistically found from will not be concentrated ata certain haplotype, but will be widely and thinly dispersed.

On the other hand, when chain imbalances are seen between the analyzedSNP groups, then SNPs that statistically characterize the sample groupsare contained in the groups, and the frequency of appearance of acertain allele in those SNPs increases. Also, it is predicted that theprobability distribution of the haplotypes that are the results of thestatistical analysis of these SNP data will concentrate on a certainhaplotype (when assay is not performed for the ‘target’ SNP).

As the method of identifying the concentration at this certainhaplotype, besides comparing the frequency of appearance of eachindividual haplotype, the total number of haplotypes predicted from thatanalysis data, the standard deviation of these haplotypes, and the ratioof the frequency of appearance with respect to all of the haplotypes ofthe upper probability haplotype group are observed as the ‘statisticalamount’, and these are compared between sample groups that are separatedaccording to the expression of the characteristic of whether or not adrug is effective or has side effects.

However, as described above, it is very difficult to select and directlytype a ‘target’ SNP. This problem is solved by applying the chainimbalance between the ‘target’ SNP and the nearby SNPs, and byestimating the domain near the ‘target’ SNP. The nearby SNP with thechain imbalance is weak when compared with when the ‘target’ SNP isanalyzed directly, and similarly, it is expected that the probabilitydistribution of the haplotype will change. This kind of nearby SNP isconsidered to be a ‘marker’ SNP for the ‘target’ SNP. In other words,the statistical amount of the sample that is the object of analysis(group with effect: Case group) is compared with the referencestatistical amount of the reference sample (group having no effect:Control group) and when the difference exceeds a preset threshold value,it is determined that there was change in the corresponding typingdomain (estimated as the marker SNP), and a specified ratio (for example1/5 to 1/10) with respect to that typing domain is set as a new scanningdomain, and the next processing cycle is performed.

FIG. 6 is a drawing that shows an example of the analysis data in stepS7 shown in FIG. 1. FIG. 6 shows the procedure for determining the new‘scanning domain’ for the later half of the inspection of analysis datain step 5.

A domain where a certain haplotype stands out is found from the typingdata of the ‘marker’ SNP without directly typing the ‘target’ SNP, andthe process returns to step 1 to set that domain as the next ‘scanningdomain’, and the process is repeated.

Also, in that cycle, after all of the SNPs in the ‘scanning domain’ aretyped, the process advances to the next step 6, and the SNP positionsthat converge at that certain haplotype are set as the ‘target’ SNP.

The purpose of this step is to estimate the ‘marker’ SNPs, and byrepeating the cycle of narrowing down the ‘scanning domain’ near thenewly estimated ‘marker’ SNP, it can be said that it is possible tobring the ‘marker’ SNP close to the ‘target’ SNP. Even though the‘target’ SNP may be derived from analysis before the last cycle, at thatpoint it is still not possible to set that SNP as the ‘target’ SNP, soit becomes the ‘marker’ SNP.

(h) Step 6 (specifying the ‘target’ SNP, step S8) The purpose of thisprocess is to correlate the ‘target’ SNP that is selected from theoverall process with the drug responsiveness of the newly developeddrug, and quantitatively derive the degree of correlation.

In this process, a final inspection is performed in the sample group forthe allele frequency of the SNP that is set as the ‘target’ SNP. (AChi-square test or method of maximum likelihood is used.)

Furthermore, it is effective to perform a comparison with SNP data otherthan that of the sample group, particularly data belonging to the ‘TheInstitute of Medical Science, The University of Tokyo’ that is relatedto SNP representative of the Japanese people. Moreover, there is amovement in some genome research businesses to sell databases related toSNP, so these can also be used.

Evaluation tests are performed using a Chi-square test or correlationanalysis to evaluate whether the expected results are obtained at acertain percent of probability, or whether severe side effects occur ata certain percent of probability.

Here, the method of identifying contaminated data will be explained indetail.

1) SNP allele distribution in a plate that was set according to thetyping results (Typing is performed for four kinds of SNP out of amaximum 96 (actually up to 92) on one plate, however, this means theallele distribution for each SNP.)

Here, the ‘allele distribution’ is the total number of samples that takeone of the three values in the group of SNP typing data (SNP typing dataof the sample for which typing was performed for the same SNP on thesame plate) expressed by any of three values, such as AA/AB/BB orAA/Aa/aa. (For example, AA:23, AB:54, BB:15)

2) Ideal SNP allele distribution based on Hardy-Weinberg Equilibrium:‘Hardy-Weinberg Equilibrium’ is said to be the state where when thefrequency of ‘AA’ is taken to be α², the frequency of ‘BB’ is (1−α)²,and the frequency of ‘AB’ is 2α(1−α)². (Here, [frequency of‘AA’]>[frequency of ‘BB’]).

(The probability of a single ‘A’ (=frequency) is α, so when both are‘A’, the frequency ‘AA’ becomes α², the probability of ‘B’ that is notthe probability of ‘A’ is 1−α, the probability of ‘BB’ is (1−α)², andthe probability of ‘AB’, which is the total probability of the two cases‘A’×‘B’ and ‘B’×‘A’, is 2α(1−α)².

The ‘ideal SNP allele distribution based on Hardy-Weinberg Equilibrium’is calculated with the frequency of ‘AA’ taken to be α², and thefrequency of the remaining ‘BB’ and the frequency of ‘AB’ beingcalculated as (1−α)² and 2α(1−α)², respectively. In other words, inexample 1), the frequency of ‘AA’ is 23/96=25%, and the probability of asingle ‘A’ becomes 50%. Therefore, the frequency of ‘BB’ is also 25%,and the allele distribution is 96×25%=23, the frequency of ‘AB’ is 50%and the distribution is 96×50%=46, and becomes the ‘ideal SNP alleledistribution based on Hardy-Weinberg Equilibrium’.

3) Comparison of the aforementioned two allele distributions: The twoSNP alleles of 1) and 2) are checked by a Chi-square test whether or notthey have a ‘significant statistical difference’.

In the previous examples, AA, AB and BB are as follows.

-   -   1) 23, 54, 15    -   2) 23, 46, 23.

Here, the result of the Chi-square test is 12.41%, which is greater thanthe value (probability less than 5%) recognized as a ‘significantstatistical difference’, so it is determined that there is nosignificant difference.

After identifying contaminated data in this way:

4) When significant difference is found in the comparison results, andwhen it is found that there is a ‘significant statistical difference’ inthe result of 3) for two or more SNPs that were simultaneously typed onthat plate, typing of that entire plate is determined to have failed,and that typing data is discarded.

5) When ‘significant statistical difference’ is found in only one SNP ofthe SNPs on the same plate, typing for just the SNP for which thesignificant difference was found is performed again. (When the sameanalysis result is obtained, that typing data is stored as being‘correct’. When the ‘significant difference’ in the re-executed typingresult is not removed, typing is performed for a third time. Theprocedure below is repeated, and typing is performed until the sameresult is continuously obtained.)

The method of specifying SNP of a first embodiment of the invention isas was described above, and it has the following effect.

When identifying an SNP from among the typed SNP that is related todisease susceptibility or drug responsiveness, by gradually narrowingdown the base sequence domain that is the object of the analysis from alarge domain to a more localized domain, and by performing typing of theSNP using a quality controlled process, it is possible to finallyidentify a related SNP.

(Embodiment 2)

Next, a method for specifying SNP of a second embodiment of theinvention will be explained.

When determining the ‘scanning domain’ in step 1 (step S3 in FIG. 1), itis possible to set the initial ‘scanning domain’ on the genetic level,which is more detailed than the larger chromosome level. In this way,typing can be started from a narrow initial ‘scanning domain’, so the‘scanning domain’ is more effective.

When setting the ‘typing’ SNP in step 2 (step S4 in FIG. 1), when theinterval between observed SNP positions is too large (particularly inthe early stages of the ‘scanning domain’), it is almost impossible toobserve any chain imbalance. It is necessary to select ‘typing’ SNP havean interval that is shorter than the physical distance of the level inwhich it is possible to observe this chain imbalance.

In the initial ‘scanning range’, this interval is such that one SNP isselected for 1,000 SNP. In other words, the initial value is set suchthat one SNP is selected for 100,000 nucleotide bases. However, thisjust a calculated value, and depending on the degree of the chainimbalance seen near that, there are case in which a smaller value istaken (one SNP is selected for several tens of SNPs).

In the estimation of the ‘marker’ SNP in step 5 (step S7 in FIG. 1),beside the method of comparing changes in ‘statistical amounts’ such asthe total number or frequency of appearance of haplotypes, it ispossible to determine whether or not there is a chain or chain imbalanceby finding the SNP pattern that is common with the haplotypes having ahigh level of probability.

In the example of haplotype probability explained in the firstembodiment, the sum of the haplotype probabilities for sample X andsample Y were as shown below. SNP#1 SNP#2 SNP#3 Probability A T T25%/2 + 50%/2 = 37.5% A G T 25%/2 + 50%/2 = 37.5% A T C 25%/2 = 12.5% AG C 25%/2 = 12.5%

In this example, for this group of haplotypes it is shown that SNP#1takes allele A and SNP#3 takes allele T at a probability of 75%, so itcan be estimated that there is link between SNP#1 and SNP#3.

Also, in a more complicated example: SNP#1 SNP#2 SNP#3 SNP#4 SNP#5Probability A T T A T 30% A T T A G 25% C T T A T 10% C T T A G 10% A TT C T 5% A T T C G 5% A G C A T 3%

A haplotype pattern is observed in which SNP#1 has allele A, SNP#2 hasallele T, SNP#3 has allele T and SNP#4 has allele A at a probability of55%. Furthermore, a haplotype pattern is observed in which SNP#2 hasallele T, SNP#3 has allele T and SNP#4 has allele A at a probability of75%, and a haplotype pattern is observed in which SNP#2 has allele T andSNP#3 has allele T at a probability of 85%. From this it is determinedfrom the data of this sample group that a chain can be seen centered onSNP#2 and SNP#3.

The threshold value of the appearance probability that divides whetheror not there is a chain is related to the number of SNP scanned in thewindow 10 w. As the number of SNP increases, the variations ofhaplotypes increases, and the probability that each individual haplotypewill be observed is decreased, so as a result, the threshold value alsobecomes lower. When a window 10 w in which 10 SNP are observed is used,a threshold value of 70% is considered to be appropriate. The domain(window size) scanned by the window 10 w varies according to the object,however, about 3 to 25 SNP is generally considered to be appropriate.

The method of specifying SNP of a second embodiment of the invention isas described above, and has the following effect in addition to theeffect of the first embodiment.

It is possible to specify the ‘target’ SNP or the domain near it bydetermining whether there are different chains existing between twosample groups that are compared, or whether there is a chain that existsin only one sample group.

The embodiments were explained using haplotype analysis; however, it isevident that general statistical analysis could also be used. Also, theinvention is not limited to that, and when applying this invention itcan be applied to a suitable SNP specifying method. Moreover, in orderto realize the embodiments, it is possible to construct a system forspecifying SNP that comprises a chromosome-level or DNA-level SNP typingprocess apparatus and a computer that performs statistical analysis, andthat is capable of performing a series of processes. Also, the number,position, shape of the components are not limited to those of theembodiments described above, and the invention can be embodied by usingany suitable number, position or shape. In the drawings, the samereference numbers are used for identical component elements.

Industrial Applicability

The present invention is constructed as described above so has thefollowing effect.

As was explained above, with this invention, by estimating a SNP to be amarker, the base sequence domain that is the object of analysis isgradually narrowed down from a large domain to a more localized domain(statistical amounts for patients and non patients are compared and theSNP domain is narrowed down), and furthermore, by typing the SNP using aquality controlled process, it is possible to finally specify SNPrelated to disease susceptibility or drug responsiveness.

1-15. (canceled)
 16. A method of specifying SNP related to diseasesusceptibility or drug responsiveness and comprising: a first step ofdefining a continuous domain that contains a specified number of SNPsdetermined by a range of several to several tens as a window, andsetting a scanning domain beforehand in said window that will be theobject of SNP analysis; a second step of gradually narrowing down saidscanning domain to a localized domain that contains a target SNP; and athird step of specifying said target SNP from said narrowed downlocalized domain.
 17. The method of specifying SNP of claim 16 whereinsaid second step comprises a step of setting a marker SNP for specifyingsaid target SNP and gradually narrowing down said scanning domain. 18.The method of specifying SNP of claim 17 wherein said second step usesstatistical analysis such as haplotype analysis to set said marker SNP.19. The method of specifying SNP of claim 18 wherein said first stepcomprises: a step of setting the scanning domain of said window in agenome domain that is limited to genes whose functions are clearly knownor chromosomes whose functions can be predicted; and said second stepcomprises: a fourth step of selecting a group of SNP to be typed fromsaid scanning domain and performing SNP typing using a wet process; afifth step of finding the probability of appearance of all combinationsof said haplotype analysis in said scanning domain based on typing dataof said SNP typing as a statistical amount; and a sixth step ofcomparing the found said statistical amount with a preset or estimatedreference statistical amount, and when there is significant deviationbetween said statistical amount and said reference statistical amountthat exceeds a preset threshold, determining that said marker SNP iscontained in the domain corresponding to the deviated position thatexceeds said threshold value.
 20. The method of specifying SNP of claim19 wherein said third step comprises: a seventh step of increasing thespecified ratio of the number of SNPs to be the object of typing in theselection of the SNP group in said fourth step when said significantdeviation is less than a first threshold value, and then repeating saidfifth step; an eighth step of setting a new scanning domain from saidscanning domain that has been decreased by a specified ratio such thatit contains the position of the deviated peak when said significantdeviation is greater than said first threshold value, but less than asecond threshold value, and then repeating said fifth step; and a ninthstep of determining that said marker SNP is contained in the domaincorresponding to the deviated position that exceeds said secondthreshold value when said significant deviation exceeds said secondthreshold value, setting a new scanning domain from said scanning domainthat has been decreased by a specified ratio such that it contains theposition of the deviated peak, and then repeating said fifth step. 21.The method of specifying SNP of claim 20 wherein said ninth stepcomprises a step of setting SNPs that include the target SNP for whichall DNA samples are typed when the number of SNPs in a selected group isless then a specified number.
 21. The method of specifying SNP of claim20 wherein said seventh step comprises a step of determining that thetarget SNP is not contained and stopping the process when the number oftimes the process of said fifth step is performed exceeds a specifiednumber of times.
 22. The method of specifying SNP of claim 20 in whichsaid eighth step comprises a step of determining that the target SNP isnot contained and stopping the process when the number of times theprocess of said fifth step is performed exceeds a specified number oftimes.
 23. The method of specifying SNP of claim 16 further defining acontinuous domain that contains a specified number of SNPs determined bya range of several to several tens as a window, and statistically findsthe probability of appearance of each combination of haplotypes from SNPtyping data (all samples) in said window.
 24. The method of specifyingSNP of claim 16 wherein the number of said SNP is ten.
 25. The method ofspecifying SNP of claim 16 wherein the number of said SNP is three tofive.
 26. The method of specifying SNP of claim 16 further comprisingmoving said window from the start to the end of the ‘scanning domain’during the processing cycle, and analyzes the SNP data contained in saidwindow.
 27. A computer program that can be read by a computer that canexecute the processing of the method of specifying SNP of claim 16wherein all of the steps of claim 16 are coded.